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Current all-atom potential based molecular dynamics (MD) allow the identification of a protein's functional 
motions on a wide-range of time-scales, up to few tens of ns. However, functional large scale motions of proteins 
may occur on a time-scale currently not accessible by all-atom potential based molecular dynamics. To avoid 
the massive computational effort required by this approach several simplified schemes have been introduced. 
One of the most satisfactory is the Gaussian Network approach based on the energy expansion in terms of the 
deviation of the protein backbone from its native configuration. Here we consider an extension of this model 
which captures in a more realistic way the distribution of native interactions due to the introduction of effective 
sidechain centroids. Since their location is entirely determined by the protein backbone, the model is amenable 
to the same exact and computationally efficient treatment as previous simpler models. The ability of the model 
to describe the correlated motion of protein residues in thermodynamic equilibrium is established through a 
series of successful comparisons with an extensive (14 ns) MD simulation based on the AMBER potential of 
HIV-1 protease in complex with a peptide substrate. Thus, the model presented here emerges as a powerful tool 
to provide preliminary, fast yet accurate characterizations of proteins near-native motion. 

I. INTRODUCTION 

Considerable insight into the biological activity of a protein can be gained by identifying its large-scale functional movements. 
Ideal tools for a detailed characterization of such dynamical properties are constituted by computational techniques such as 
molecular dynamics (MD) simulations based on effective all-atom potentials 1 1]. By these means it is possible, at present, to 
follow numerically the dynamical evolution of a protein of a few hundred residues in its surrounding solvent over time intervals 
of tens of nanoseconds. 

Such time-scales allow to gain considerable insight into important aspects of protein dynamics and to make quantitative con- 
nections with experimental quantities such as NMR order parameters ylyl and Trp fluorescence spectra |4]. Other complex 
conformational changes are however difficult or impossible to be observed. Examples include protein-protein molecular recog- 
nition, rearrangements occuring upon ligand binding etc. which all involve time-scales of the order of 1 /.is or longer[i5||. In 
addition, the simulated trajectory might not be sufficiently long that thermodynamic averages can be legitimately replaced with 
dynamical ones|6]. 

Several studies have attempted to bridge the gap between the time scales of feasible MD simulations and the ones of 
biologically -relevant protein motion by recoursing to a mesoscopic rather than a microscopic approach| 7] . In fact, the large-scale 
dynamical features encountered in MD trajectories can be conveniently interpreted, at a first approximation, as a superposition 
of independent harmonic modes |5]. This observation was complemented by Tirion who pointed out that, in a normal mode 
analysis of protein vibrations, the detailed classical force-field could be replaced by suitable harmonic couplings with the same 
spring constants |7]. These results stimulated a variety of studies where the elastic properties of proteins were described through 
coarse-grained models where amino acids are replaced by effective centroids corresponding to the C a atoms and the energy 
function is reduced to harmonic couplings between pairs of spatially close centroids. These approaches, in particular the Gaus- 
sian and anisotropic network models (GNM and ANM), have been found to be in accord with both experimental and MD results 

ilia. 

Here we introduce an extended network gaussian model which, at variance with previous approaches, incorporates effective 
Cp centroids "tethered" to the C a atoms. The presence of the effective Cp's allows a good control of the directionality of pairwise 
interactions in the protein and thus leads to an improved vibrational description. Furthermore, the fact that the sidechains degrees 
of freedom are entirely controlled by the C a 's has the crucial implication that the computational effort required to characterize 
the system is exactly the same as for models based only on C a 's. 

The model equilibrium dynamics is compared against MD simulations based on all-atom effective potentials. We provide 
several quantitative estimates for how and in what sense, the simplified quadratic approaches can provide a complement of the 
more accurate but also much more computationally-demanding all-atom potential based MD calculations. From a general point 
of view, the ideal term of reference would be constituted by a direct experimental determination of the quantities of interest here, 
such as the correlation of residues' motion. However, since this is not presently feasible, such detailed information can only be 
obtained from MD simulations. Although such simulations do not provide an absolute term of comparison, it is certainly an 
adequate reference for the much simplified protein models and energy functionals considered here. 

The reference system considered here is the complex formed by the HIV-1 protease dimer with a bound model substrate (Fig. 
Q. This protein appears to be suitable to serve as reference for our model calculations in many respects. First, its dynamics in 



FIG. 1: Backbone trace of the complex formed by the HIV-1 protease homodimer and the bound substrate. Each monomer is composed of 99 
residues; The model substrate consists of six amino acids 1 1 ll. ll 31 . 

aqueous solution has been extensively investigated over more than 10 ns by all-atom effective potentials MD simulation fTllfl2ll . 
Second, a large-scale motion analysis based on this MD simulation has been already performed. Third, this protein/substrate 
complex does not contain metal ions, prostetic groups, cofactors or non standard amino acids, and therefore it is amenable as 
a first test system for our model calculations, which consider only standard amino acids. Finally, this protein is of outstanding 
pharmaceutical relevance (it is one of the two targets currently used in anti-AIDS therapy) and its dynamics has been revealed 
as a key ingredient for the enzymatic function and for rationalizing resistance data fTllfl3[fl4lfr5l . 
The data obtained from the MD simulation of refs. 

mill 

have been used here to evaluate all the MD-related quantities 
used for comparison against the results for quadratic models. In support of the general applicability of Gaussian approaches to 
capture the details of proteins essential motion we also report a comparison of the model predictions against data from a recent 
MD study of the NGF-trkA complex formed by the nerve growth factor and the tyrosin Kinase A receptor ! 1611 . 

II. THEORY 
A. The model 

The starting point of the present analysis is the expansion of the Hamiltonian in terms of the deviations of the amino acids 
from their reference native positions. The underlying assumption is that a protein immersed in aqueous solution vibrates around 
its native state with amplitudes so small to justify the quadratic expansion around the minimum of the potential energy function. 
A posteriori this does not seem to be a drastic restriction 11711 despite both the large-amplitude of motion observed in 
dynamical trajectory and the existence of several native sub-states, instead of a unique energy minimum 1 18[. 

To reduce the spatial degrees of freedom of a protein we adopt a coarse-grained model where a two-particle representation is 
used for each amino acid: besides the C a atom, an effective Cp centroid is employed to capture, in the simplest possible way, 
the sidechain orientation in a given amino acid (except for GLY for which only the C a atom is retained). To distinguish the 
proposed model from those based on the C a only representation will shall refer to it as the (3 Gaussian model (/3GM for brevity). 
Since our focus is to study the concerted vibration of various amino acids around the native state of the system, the Hamiltonian 
that is adopted incorporates, accordingly, pairwise interactions between all pairs of particles that are sufficiently spatially close 
in the native state. Formally, the system energy function evaluated on a trial structure, T, takes on the form 

H{T) = Hbb{T) + n aa {T) + n aP {T) + H Pf} {T) (1) 
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In expressions A^ y is the native contact matrix that takes on the values of 1 [0] if the native separation of the effective 
particles of type X and Y, belonging respectively to residues i and j, is below [above] a certain cutoff value, R. With d* Y , on 
the other hand, we denote the actual separation of the particles in the trial structure, T. The indices i and j run over all integer 
values ranging from 1 up to the protein length, N. In particular, the interaction between particles in consecutive amino acids, 
\i — j\ = 1, leads to a simple treatment of the protein chain connectivity. However, to account for the much higher strength of 
the peptide bond with respect to non-covalent contact interactions between amino acids, we have added in Q an explicit chain 
term, H.bb, where the interaction of consecutive C a 's is controlled by k > 0. 

By construction, the minimum of the various interaction terms is attained for the native separation of each pair of particles. 
This ensures that the native state is at the global energy minimum. For small fluctuations around the native structure, the potential 
interaction energy of two particles, i and j, can be expanded in terms of the deviations from the native distance-vector, ry. If 
we indicate the deviation vector as so that the total distance vector is dij — fij + Xij, we can approximate the pairwise 
interaction as 



vfa) « v( nj ) + ^-pi Y. ^4 4 (3) 

^ '71' 

where \i and v denote the cartesian components, x, y and z, and V" is the second derivative of V. Several models have been 
introduced previously, where the quadratic expansion (0 was used in a context where only interactions among C a centroids 
where considered, such as in the anisotropic gaussian model (ANM), recently introduced to study the vibrational spectrum of 
proteins | llj • 

Based on this quadratic expansion the Hamiltonian of eq. Q can be approximated as, 



where M is a 3iVx3-/V symmetric matrix. The elastic response of the system is uniquely dictated by the eigenvalues and 
eigenvectors of M. 

What differentiates the /3GM from several previous studies is the presence of the interactions between C a and Cp and Cp-Cp 
(besides the extra strength of the chain term). The introduction of the Cp centroids in the protein description leads, in principle, 
to a more complicated Hamiltonian, with the additional Cp's degrees of freedom: 



H = - V x CA M CA - CA x CA 
+ V x CA M CA - CB x CB 

~ / J i,£l l], [III 3;V 

However, the location of the Cp atoms in a protein structure is almost uniquely specified by the geometry of the peptide chain. 
An accurate method that predicts the location of the Cp atoms from the CA trace of a protein is the geometric construction of 
Park and Levitt| 19], which assigns the zth Cp location given the positions of the C Q 's of residues i — 1, i and i + 1, allows 
to place the fictitious Cp at a distance of 0.3 A from the crystallographic location. Such excellent agreement clarifies that the 
degrees of freedom of the Cp centroids should not be considered independent from the C a ones. On the contrary, the Cp's can 
be viewed as rigidly "tethered" to the C a and hence the fluctuations of the former are dictated by those of the latter. 

Although in principle one could use the original rule of Park and Levitt 1 19], we have adopted a simpler construction scheme 
which places the Cp exactly in the plane specified by the local C a trace. This simplifies the construction of the M matrices 
which remain "diagonal" in the cartesian components (e.g. the x component of the reconstructed Cp depends only on the x 
components of the neighbouring C^'s. More precisely, the location of the ith Cp is given by 



rcB{i) = rcA{i) + I 



2r CA (i) - r C A(i + 1) ~ t C a^ - 1) 
\2?cA{i) - r C A(i + 1) - r C A{i - 1)| 



(6) 
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For reasons of self-consistency of the model, this construction rule is used to determine the contact matrices involving the 
effective Cp centroids that are used in place of those depending on the crystallographic Cp locations in eqn 0. To leading order 
in the deviations of the C a atoms, the deviations of tcb{i) thus becomes: 



- t-\ i 2x ca{i) - x C a{i + 1) - x C a{i - 1) 

XCb{1) ~ t-TTT^ 7TT ,. 7T J. 7T7 ■ (7) 

\2r C A(i) ~ r C A(t + 1) - r C A(i - 1)| 

where I = 3 A. By using this rule, one parametrizes, in terms of the C a positions the effective Gp location of all residues 
except for GLY and for the initial and final residues which lack one of the flanking C Q 's. When the resulting expressions @ 
are substituted in equation (0 one obtains an effective quadratic Hamiltonian which, as in equation involves only the C a 
deviations but coupled through a new effective matrix, M which is distinguished from previous matrices by a tilde superscript. 
The book-keeping operations necessary to calculate the elements of such matrix are conveniently implemented with the aid of a 
computer. Thus, the computational cost and difficulty to characterize the elastic response of the protein is reduced to exactly the 
same as models with C a atoms only. In spite of the same computational cost, the /3GM appears to have several advantages in 
terms of the ability to capture the low-frequency motion and other vibrational properties of proteins, as will be seen below. 



B. Equilibrium properties 



The derivation of the vibrational properties of a protein from the quadratic expansion of the Hamiltonian can be done, broadly 
speaking, in two different ways: the normal mode analysis and the Langevin analysis. What differentiates the two approaches 
is the view of the role of the solvent on the system dynamics. If one assumes that the motion of the protein is not significantly 
damped by the interaction with the solvent, then the normal modes picture can be a pplie d to study the system dynamics around 
the native state by solving the Newton's dynamical equations 1171 l20l.l2ll l22l.l23H24l.l25l 12611 : 



mi x itl j, = ^ Mij,p V xf* (8) 
i,v 

The eigenfrequecies and eigenvectors are hence obtained by diagonalizing a matrix derived from M by an appropriate mass- 
weighting 12711 . 

Although the normal-mode analysis allows a straightforward dynamical characterization it is of dubious applicability since 
protein motion in a solvent does not resemble a superposition of pure harmonic oscillations. In fact, several theoretical, experi- 
mental and computational studies, have shown that the dynamics of a protein is severely overdamped by the interaction with the 
solvent I2ul23ll28ll29ll . The description of the motion in terms of overdamped dynamics appears to be particularly valid for the 
prot ein's low-frequency vibrations, that are the most interesting ones due to their expected role in proteins functional activities 
12611 . This observation leads to the alternative view of a heavily damped dynamics 1 3011 . 

In this case, for small deviations from the reference positions, the dynamics of the amino acids can be written as: 



ii,At(*) = ~ Mjj^y Xj jV (t) + Tli,n(t) (9) 

where the time unit has been implicitly chosen so that the viscosity coefficients (assumed to be equal for all particles) are set 
equal to 1 and the stochastic noise terms satisfy l3lll : 

(Vi,») = (10) 

( i m,iiVj,v) = Sij s^^kbT . (ii) 

These two conditions ensure, in the long run, the onset of canonical thermal equilibrium, so that the equilibrium probability of a 
given configuration, {x}, for the particles in the system is controlled by the Boltzmann factor: 

e -f3H({x}) = e -P Ey,,/'."^.'"' 1 ^ , (12) 

In this case, no periodic motion of the system can exist in the absence of external periodic excitations, since any structural 
deformation will be dissipated by a damped dynamics. The standard theory of stochastic processes 1 20, 31] shows that the 
eigenvalues of M are inversely proportional to the system relaxation times, and the corresponding eigenvectors indicate the 
actual shape of the associated distortion of the system. 
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C. Covariance matrices and temperature-factors 

Besides identifying the elementary modes of excitation of a protein, it is important to calculate suitable thermodynamic 
quantities that characterize the protein dynamics once thermal equilibrium with the solvent has established. The main observable 
that can be calculated within the gaussian model is the degree of correlation of the displacement from the equilibrium (native 
position) of pairs of C Q 's. The thermodynamic average of the correlated displacements, are easily obtained from the inversion 
of the M matrix. In fact, after setting 1/(3 = KqT = 1, one has 

(x i ,»x j , v )=M£ ltv (13) 

where the brackets denote usual canonical thermodynamic averages with the weight of eqn. dl 2I >. The inverse matrix, M~^ v , is 
often referred to as the covariance matrix. Since it provides directional details about the correlated motion of pairs of residues 
we shall term it full covariance matrix to distinguish it from the reduced one discussed below which incorporates only a measure 
of the degree of correlation (but no directional information). 

The eigenvectors of the full covariance matrix represent the three-dimensional independent modes of structural distortion 
for the reference protein. The modes associated to the largest eigenvalues of M _1 are the slowest to decay in a dissipative 
dynamics and, hence, make the largest contribution to the mean-square displacement of a given residues. The latter quantity is 
straightforwardly calculated from eqn. (1131 : 

and can be directly connected to the temperature-factors (also called B-factors) measurements reported in X-ray or high- 
resolution NMR structural determinations |32]. 

It is worth remarking that the full covariance matrix provides information about the system elasticity not only in conditions of 
isolation but also when an external force, /j, is applied to a given residue, i. In fact, within the Gaussian approximation, the 
average displacement of the jth amino acid from its reference position due to the application of fi is given by 

A" 

As anticipated above, an important role in the analysis of molecular dynamical trajectories is also played by the reduced covari- 
ance matrix whose elements, Cy ■, are defined as 

C ij = {x i -x j ) = Y,Mr j ^ (16) 

In ordinary MD simulations, the thermodynamic average in (1 161 is replaced with the time average taken over the simulated 
trajectory (ergodicity assumption). Due to the fact that the C matrix is obtained from M^ 1 after a summation over the cartesian 
components, the linear size of C is equal to the number of protein residues, N, instead of 3 N as for 1 . This ten-fold reduction 
of information greatly simplifies the identification of significant correlations between residues motion. 

We conclude this section by discussing a technically important point. The inversion of the M matrix used in eqns. J13H16i . 
as well as a correct interpretation of equation d!2i are possible only within the subspace orthogonal to the eigenvectors of M 
associated with zero eigenvalues. Physically this corresponds to omit the structural modifications that cost no energy (zero 
modes). Due to the invariance of Hamiltonian ([0 under rotations and translations of the Cartesian reference frame, there will 
always be at least six zero modes. This number can, however, be larger if the M matrix is sparse. The presence of additional 
spurious modes in Gaussian network models that incorporate only C a coordinates is usually achieved by two means: either a 
reduction of the dimensionality of M (as in GNM) or by using large interaction cutoffs in the range 10-15 A( as in ANM). The 
model discussed here allows to use physically-appealing interaction cutoffs of the order of 7 A, as for GNM, and yet retaining 
the full three-dimensional detail in the M matrix. As will be shown later, these ingredients are necessary to capture the finer 
aspects of protein vibrations such as the correlation of residues' motion, while, consistently with previous studies, the overall 
mobility of individual residues is rather insensitive to the details of the model 13 3[ 13411 . 

In fact, we found that GNM, ANM and /3GM have a similar performance on the prediction of experimental B-factors. This 
was established using high-resolution, single-chain proteins taken from the non-redundant pdb-select list |35j. We restricted 
to proteins length between 50 and 200 residues and excluded from the comparison the first and last 5 residues to avoid biases 
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due to enhanced terminal mobility. Overall we selected 36 proteins determined with Xray and 3 1 with NMR. For simplicity we 
summarise the level of agreement as the average of the non-parametric rank correlation, t [ 36 ] . This analysis does not rely on the 
knowledge of the probability distribution from which the points (pairs of data) are taken. What matters is the agreement of the 
ranking of the points according to each of the two variables. In case of perfect [anti] -correlation the Kendall parameter t takes 
on the value 1 [-1] and usually provides a more stringent (and robust) measure than linear correlation! 37]. For the X-ray set and 
using GNM, r ranged from 0.37 to 0.39 for cutoffs in the range 7.5 - 15 A. For the same range /3GM gave 0.34 < r < 0.37, while 
for ANM 0.30 < r < 0.37 for interaction ranges 10-15 A. The comparison against NMR temperature-factors provided higher 
correlations as already observed in ref. 1 32]. For the same cutoff ranges reported above one has for GNM: 0.45 < r < 0.47, for 
/3GM: 0.46 < r < 0.48 and for ANM 0.42 < r < 0.48. 

In summary, the novel model discussed here allows to incorporate, in an effective Hamiltonian, not only backbone-backbone 
interactions but also backbone-sidechain and sidechain-sidechain ones. The sidechain degrees of freedom are entirely controlled 
by the C a positions. This has the crucial implication that the computational effort required to characterize the vibrational 
properties of the system is exactly the same as for models that incorporate only interactions between C a pairs. 

III. RESULTS AND DISCUSSION 

In this section we shall examine the extent to which suitable topology-based harmonic models can capture the details of the 
near-native vibrations of proteins in thermal equilibrium. Given the present impossibility to probe experimentally the various 
thermodynamical quantities discussed before, it is mandatory to choose as a reference the results of an all-atom molecular 
dynamics calculation performed on the HIV-1 protease, in complex with TIMMNR peptide model substrate fulfil . All MD 
dynamical averages were calculated after discarding an initial interval of a few ns over which the protease complexed to the 
model substrate was equilibrated I 1 Til [T2I1 . The configuration obtained of the end of the equilibration protocol was taken as the 
reference structure for the Gaussian approach. The inversion of the symmetric M matrix, necessary to characterize the system 
equilibrium dynamics, was done exploiting the Householder reduction |37] and took about 10 minutes on a personal computer. 

For the comparison, it is important to remark that although molecular dynamics studies can reproduce reliably a variety 
of experimental quantities they ultimately rely on empirical potentials which may be imperfectly parametrised. Besides this 
issue, it should also be noted that, usually, it is not easy to ascertain whether the simulated trajectory is sufficiently long that 
thermodynamic averages as in eq. (II 61 can be legitimately replaced with dynamical ones, though a recent study has indicated 
some valuable criteria for this purpose |6]. This potential limitations of general MD approaches should therefore be borne in 
mind also in the present context. 

The series of tests carried out to ascertain the consistency among MD results and the one of gaussian models include the 
comparison of temperature factors, covariance matrices and essential subspaces. The findings, summarised below, provide a 
direct and strong indication that the /3GM is apt for capturing several aspects of proteins' near-native vibrational dynamics with 
an accuracy that rivals with techniques based on all-atom potentials. 

The generalised energy function of eq. Q contains several parameters; one of these, the interaction amplitude of C a pairs, 
Vq A _ ca , can be conveniently taken as the energy unit. The other parameters are the interaction cutoff, R (which enters in the 
definition of the contact matrices) and the amplitudes of the Cp-Cp and Cp-C a interactions as well as the extra strength of the 
peptide term, k. For reasons of simplicity the strength of these interactions have been chosen of the same order as Vq A _ ca = 
k = 1, VcA-CB ~ Vcb-cb = 1/2- This choice was done for reasons of simplicity but is not particularly restrictive due to 
the fact that effective interactions between sidechains are expected to be of the same order as C a -C a ones [ 38 ] ; in addition the 
precise value of k will mostly impact on the high frequency vibrational spectrum of the system. Hence, it can be anticipated that 
the system elastic response should mostly depend on the value of the interaction radius, R, which has been accordingly varied 
in our analysis. 

We first discuss the possibility to predict the temperature factors encountered in molecular dynamics. This type of validation 
has been considered before by Doruker et al. in connection with the anisotropic gaussian model and reported a good consistency 
with dynamical simulations |J9j . 

As a measure of the agreement between the residues mean square fluctuations observed in MD and those predicted from the 
Gaussian models, see eqn. fl!4i . we considered the linear correlation coefficient. The degree of correlation as a function of the 
interaction cutoff radius for /3GM is shown in Fig. |2] 

The performance of the /3GM particularly stable beyond an interaction cutoff of about 8.0 A. Given the large number of residues 
in the system (198 for the protein and 6 for the substrate) over which the B-factors are calculated, it is certainly possible 
to conclude that the correlation coefficient approaching 0.7 visible in Fig. [2] is statistically significant. A more quantitative 
assessment of the statistical significance could be done using the Student's i-test or related methods l37ll . although such analysis 
usually rely on the assumption that the joint distribution of the correlated variables is binormal (which is not necessarily satisfied 
for B-factors). An alternative way is to recourse to the non-parametric Kendall test mentioned before 1 37], as recently proposed 
by Halle I3al . The Kendall correlation coefficient among the B-factors of the simulation and those of the j3 Gaussian model (for 
R w 7.5 A) is r « 0.61 and amply satisfies all ordinary criteria for statistical significance. 
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FIG. 2: Linear correlation coefficient for the temperature factors of the 204 residues in the HIV-1 PR/SUB complex. 



The successful comparison of the B-factors confirms the general agreement between the overall residues' motion in MD and 
the equilibrium dynamics predicted by Gaussian models|9|; in fact, for the same cutoff used above ang agsinst the same MD 
data of the HIV-1 PR/SUB complex, GNM provides a linear correlation coefficient of 0.61 while the Kendall parameter r is 
equal to 0.59. However, the finer details of such accord have not, to the best of our knowledge, been explored yet and hence 
become the focus of our subsequent analysis which is based on a comparison of the MD and /3GM covariance matrices. This test 
is particularly important due to the wealth of biological and chemical information that can been extracted from the covariance 
(essential dynamics) analysis l39l|40l[4lll. 

In the context of HIV-1 Pr ft3[ [T^t [H E3 El El El El El 

these motions have a direct mechanical bearing on the 
structural modulation of the active site, even though they are located remotely from it| 1 U M 2lll5ll . 

As a first case we consider the reduced covariance matrices. To allow a straightforward comparison of the theoretical and 
numerical results rather than working directly in terms of the reduced covariance matrix, it is useful to focus on the normalised 
version, which is dimensionless: 



Ca = , Jl = (17) 

The scatter plot of Fig. [3]summarises the degree of accord between the normalised covariance matrix of the MD simulation and 
that of the /3-Gaussian model, for a cutoff Of 7.5 A. The number of entries in the plot is about 2 • 10 4 , equal to the number of 
distinct entries in the 204x204 C\j matrix. To avoid introducing artificial biases in the correlation, the diagonal elements of the 
normalised matrices (which are all equal to 1) have been omitted from the plot. 

The linear correlation coefficient among the two sets of data is 0.80 (and is stable in the neig hborhood of R = 7.5 A). We 
do not attempt to provide a quantitative measure for the statistical significance of the linear correlation in Fig. [5] In fact, on 
one hand, as visible in Fig. [3] the joint distribution for covariance matrix elements is only approximately binormal, and hence 
the traditional tests of linear regression significance are of dubious applicability. On the other, the Kendall correlation measure 
requires to consider all possible pairs of points in the scatter plot: this makes the analysis impractical and disproportionate to the 
main goal which is to ascertain the existence of the accord between topological gaussian models and MD results. In fact, rather 
than measuring the correlation in absolute terms, we shall compare the accord of gaussian models and MD simulations against 
the degree of "internal" consistency of the simulated dynamical trajectory itself. 

We conclude the discussion of the scatter plot of Fig. |3]by mentioning that, in case of perfect correlation of two C\j sets, due 
to the normalization condition of eqn. ^] the data would align along the diagonal of the graph in Fig. [3] Interestingly, despite 
the scatter visible in the same figure, the interpolating line lies very close to the diagonal, having a slope of s — 0.97 (taking the 
(3GM covariance as the independent variable). This fact is useful in illustrating the effects of the cutoff, R, on the accord with 
MD covariance elements. While for R — lOAthe slope is still good, s = 1.05, it deteriorates for R = 15 A, where s = 1.64. 
This effect is even more pronounced if the Cp's are not included in the model. For example, for R = 15 A the observed slope 
was s — 3.41. 

The quantification of the self-consistency of MD dynamical trajectories is an extremely important issue since it can provide an 
a posteriori indication of whether the dynamical sampling of the phase-space was sufficiently to obtain reliable thermodynamic 
averages. The analysis usually starts from the calculation of two covariance matrices pertaining to the first and second halves of 
the MD trajectory. 
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FIG. 3: Scatter plot of corresponding entries of the covariance matrices obtained within the (3 Gaussian model (R — 7.5 A) and from the 14ns 
MD simulation on the HIV-1 PR/SUB complex. The linear correlation coefficient over the 2 • 10 4 data points is 0.80 . 

Ths two matrices could then be compared entry by entry, as done above. However, the most appropriate procedure is not 
to compare the corresponding matrix elements, but rather the physically-important (essential) eigenspaces, that is the linear 
spaces spanned by the eigenvectors of M -1 associated to the largest eigenvalues. A number of studies have suggested ways of 
measuring this consistency. 

The first method of comparison that we will be taken into account is the one introduced by Amadei et al 15(111 which focuses 
on the top n eigenvectors of the covariance matrices under comparison. These eigenvectors describe the most significant modes 
of vibration of the molecule in the three-dimensional space. We stress here that the covariance matrix considered here is not 
the reduced one of eq. dl6> . whose size is NxN, but is the full one (the M _1 matrix) of size 3Nx3N that contains the three- 
dimensional information about correlated motion of pairs of residues, see e.g. eqn. (II 31 . 

By denoting the top n eigenvectors of the two matrices under comparison as {if} and {£/}, the degree of overlap of the essential 
subspaces is defined as the root mean square inner product (RMSIP) of all pairs of eigenvectors in the two sets 1 50] : 



RMSIP = J-Y,\fk'0i\ 2 (18) 

V i>3 

Customarily, the analysis is restricted to the top n = 10 eigenvectors. We have measured the RMSIP in eqn. Jl 8I > when {77} and 
{v} come from the essential subspaces of the first and second halves of the 14ns MD trajectory of the HIV-1 PR/SUB complex. 
The calculated RMSIP value, eq. was 0.7 1 . Amadei et al. 15011 have also proposed a series of approximate tests to ascertain 
the statistical relevance of the observed overlap. Based on their analysis, we can conclude that the value obtained here, for the 
given system size of 204 amino acids, has a probability to have arisen by chance that is, by far, inferior to the conventional 
threshold of 1 %. This supports the fact that the MD trajectory was sufficiently long to contain significant physical information 
about the system equilibrium dynamics. 

Having in mind the level of internal consistency of the present reference MD trajectory we have turned to measuring the 
RMSIP between the essential spaces of the whole dynamical trajectory and those of the (3 Gaussian model. 

The resulting trend for the subspaces overlap is shown in Fig. @]as a function of the interaction cutoff, R. The best performance 
of the model is obtained for a cutoff of R ss 7.5 A. The corresponding overlap value of 0.68 is very close to the internal overlap 
of the MD trajectory. We wish to remark that such values of RMSIP are highly non-trivial due to the large size of the full 
covariance matrix (612x612). This implies that the probability to observe a given overlap, q, between two random unit vectors 
decreases extremely rapidly as q approaches 1 |50J. As a consequence, the MD simulation time required to reach a given target 
value for the internal RMSIP consistency, q, grows very repidly with q. 

This observation clarifies the utility of the Gaussian approach. With a modest computational investment, required by the 
diagonalization of a 3Nx3N matrix, one obtains a description of the protein essential dynamics that, within a molecular dynamics 
framework where all atom effective potentials are used, requires a considerably heavier computational investment. Further 
improvements over the /3GM performance are obviously possible within MD, but at the price of a rapidly growing computing 
time. 

The usefulness of the (3GM is further supported by yet another type of analysis, which concludes the series of tests carried 
out here. This last approach follows a more precise measure for the agreement of the essential subspaces recently introduced 
by Hess |6i]. The new measure is an improvement over the definition of eqn. Jl 8I > since it removes both the subjectivity of the 
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FIG. 4: Degree of correlation (root mean square inner product), see eqn. i 1 8i . of the essential subspaces of the MD simulation and of the 
gaussian models as a function of the interaction cutoff, R. The thick curve denotes the performance of the j3 Gaussian model, while the 
horizontal dotted line indicates the overlap of the essential subspaces of the first and second halves of the MD trajectory. 
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FIG. 5: Degree of overlap according to the measure introduced by Hess 1 6] of the essential subspaces of the MD simulation and of the gaussian 
models as a function of the interaction cutoff, R. The thick curve denotes the performance of the /3 Gaussian model, while the horizontal 
dotted line indicates the overlap of the essential subspaces of the first and second halves of the MD trajectory. 



choice of n, assigns more importance to the physically-relevant eigenspaces and deals correctly with the presence of spectral 
degeneracies. 

Unlike the measure of Amadei et al., the one of Hess is sensitive to the actual values of the eigenvalues of M~ l , Since the 
energy units of the gaussian models considered here has been chosen arbitrarily, a proper normalisation of the spectrum of the 
covariance matrices has to be carried out in order to use the measure of Hess for the comparison against MD. For this reason, 
the degree of overlap of the matrices was carried out after having uniformly rescaled the eigenvalues of M -1 so that the trace 
of M" 1 was equal to 1 for both systems. Physically this corresponds to a normalization of the average residues mean square 
displacement. 

The results for this analysis are shown in Fig. [5] The enhanced stringency of this test, which is extended to the whole 
vibrational spectrum, and not just to the top 10 modes, is reflected in an overall decrease of the overlap with respect to Fig. |4] 
This finer measure also reflects better than RMSIP the higher degree os inner consistency of the MD trajectory, as opposed to 
the MD-/3GM accord. Although the best reference for our model would be provided by a simulation run sufficiently long that 
the inner MD overlap approaches 1, this is not presently feasible due to the slow (approximately logarithmic) increase of the 
inner overlap with the length of the MD run. Despite this fact, the results fully confirm the previous conclusions namely that the 
(3 Gaussian model can predict, with good statistical confidence, equilibrium dynamical properties of proteins and, in particular, 
identify the relevant modes of vibration of the system. 

The reference system used here for the comparison between the model and MD simulations was chosen for both its biological 
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FIG. 6: Scatter plot of corresponding entries of the covariance matrices obtained within the /3 Gaussian model (R = 7.5 A) and from the 2.6ns 
MD simulation on the NGF-trkA complex IT3l . The linear correlation coefficient over the nearly 10 5 data points is 0.86 

significance and for the availability of MD data collected over the rather long simulation time. The j3 Gaussian model is, 
however, of general applicability, and to confirm the robustness of the strategy we have considered another biologically important 
reference system, the NGF-trkA complex. This is constituted by a protein dimer, the nerve growth factor (NGF), complexed with 
the tyrosine kinase A receptor (trkA); altogether the system comprises 431 residues. The dynamics of the complex in aqueous 
solution was recently simulated for a time span of 2.6 ns using all-atom potentials 1 16]. The resulting normalised covariance 
matrix was compared with the one obtained from the (3 Gaussian model. The linear correlation coefficient over the nearly 10 5 
corresponding distinct entries of the matrices is 0.86, as visible in Fig. [6] 

This result confirms the viability of the Gaussian approach to capture the details of the large-scale protein movements. However, 
due to the simplicity of the model interaction potentials (all pairs of C a 's and Cp's interact with the same strength) one may 
foresee that the model is not suitable for modelling the vibrational dynamics of proteins where electrostatic effects or disulfide 
bonds play an important role for native stability or funtionality. 

Biological implications 

In the MD calculations by Piana et al. the essential dynamics analysis has allowed to identify the sites that, despite 

being spatially distant from the active site have a strong mechanical influence on the structural modulation of the regions binding 
the substrate. We now compare the findings obtained with MD with those obtained with our model. The degree of mechanical 
coupling observed in the MD trajectory between the substrate motion and the HIV-1 protease subunits is visible in the top panel 
of Fig. The two curves in the plot represent the profile of the reduced covariance matrix, Cy, between the two central atoms 
of the peptide and the 198 protease residues. Interestingly, the regions that correlate significantly with the substrate motion are 
those that have been indicated as rather under-constrained by studies where the theory of rigidity has been applied to characterize 
the enzyme elasticity 1 5 1 ] . The two facts provide a consistent picture for the HIV- 1 PR mechanics since any functionally-relevant 
mechanical coupling must intuitively involve mobile (and hence under-constrained) regions. The identification of the mobile 
regions of HIV-1 protease has also been previously addressed with Gaussian network models in a series of studies which also 
allowed to identify the residues important for protein stability I 52ll53ll or otherwise taking part to crucially important networks 
of key native contacts ll54ll . 

The bottom panel of figure[7]represents the corresponding correlation profiles calculated within the /3GM for a cutoff R = 7.5 
A(which we take as an optimal value from previous analysis). The degree of agreement of the profiles across the two panels is 
remarkable and, the main difference appears to be due to an overemphasis of negative correlations in the gaussian profiles. 

In both sets of data one observes a strong positive correlation between the substrate motion and the regions 24-30 and 45-55. 
This direct mechanical coupling is of immediately interpreted due to the fact that the first region comprises the cleavage site 
while the second involves the tips of the protease flaps. 

The essential-space analysis of the MD trajectory revealed that the two regions embrace the substrate and involve it in a 
rotational "nutcracker-like" motion. As a consequence of this rotation, the regions near the flaps elbows, 37-41 and 61-73, 
undergo a counter-movement that results in a negative correlation with the substrate motion. This effect is clearly visible in the 
gaussian profiles of Fig. Q while it is less pronounced, but still significant, in the MD results. 

As remarked in the Theory section, the intimate connection between the linear response theory and the covariance matrix 
allows to conclude that a force applied in correspondence of sites around residues 40 and 63 should affect the protease-substrate 
coupling. This effect was indeed observed in the MD simulation where the motion of the substrate towards the cleavage site was 
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FIG. 7: The two curves in each panel indicate the degree of correlation of the motion between the two central atoms of the substrate and the 
198 residues in the subunits of the HIV-1 protease. The top panel reports the MD findings, while the bottom one pertains to the Gaussian 
model. 

strongly affected by the constraints on these regions (see Fig. 6 in ref. iHHh . 

It is by virtue of this mechanically-important coupling that it is possible to rationalize the emergence of mutations causing drug 
resistance in correspondence of sites far from the cleavage region (e.g. M63I, M46I-L and L47V). In fact, as the explicit MD 
calculation has shown, the high degree of such coupling between such sites and the cleavage is such that the detailed chemical 
identity of the former strongly influence the substrate binding affinity of the latter. In particular, the mutations observed in clinics 
1451 H6L Wa HsL 14911 are arguably the result of a chemical fine-tuning that retains the native functionality of the enzyme while 
decreasing its affinity for inhibiting drugs. 

The details of how the enzymatic reaction kinetics changes upon amino acid mutations is beyond the reach of the topological 
models presented here. The gaussian scheme, in fact, is entirely adequate to identify which sites influence mechanically the 
active site motion but, due to the fact that all amino acids (except for GLY which lacks the Cp centroid) are treated equally 
we cannot explore the ramifications of changing the amino acid identity into the cleavage-region and substrate coupling. The 
effect of one of such mutations (M46I) on substrate motion has instead been fully taken into account in ref. 1131 . Further 
improvements of the Gaussian model may be possible by optimizing the distance of the Cp centroids from their respective 
C Q 's (so to better capture the displacement of the sidechains centres of mass). The model could also be extended to include, 
at the simplest possible level, the effects of thermal denaturation through a self-consistent temperature-dependent weakening of 
the strength of the harmonic couplings, as done in refs. I32.I55I1 . 

IV. CONCLUSIONS 

We have examined the extent to which the dynamical properties of a protein in thermodynamic equilibrium can be accounted 
for through solvable models. The starting proint of our analysis is the quadratic approximation of the free energy landscape 
in terms of the deviations of amino acids from their reference positions in the known native state. We have adopted a novel 
description of the amino acids which allows to consider the presence of effective C a centroids whose degrees of freedom are 
entirely controlled by the C a atoms. As a result, the model that is considered, is able to account for the directionality of amino 
acid sidechains while retaining the same degree of complexity as models based on C a representation only. Various equilibrium 
quantities apt to characterize the most relevant modes of vibrations of proteins are considered. In particular, we focussed on 
(in increasing order of complexity and detail) the B-factors, the covariance matrices and the essential dynamical subspace. Our 
results have been compared against the analogous quantities obtained through a 14ns molecular dynamics simulation carried out 
on the HIV-1 PR enzyme in complex with a TIMMNR peptide substrate. 

As fas as overall equilibrium dynamical properties are concerned, the j3 Gaussian model provides a picture that is in remarkable 
agreement with the MD results. In fact, the essential subspace predicted theoretically appears to have a degree of consistency 
with MD results that is close to the "inner consistence" of a 14ns MD simulation with itself. 

This provides a strong indication that suitable quadratic models can provide a powerful and accurate tool for characterizing 
the vibrational motions of proteins near their native state while requiring only a modest investment of computational resources. 
Other important properties of protein dynamics and functionality that strongly depend on the sequence composition or on out-of- 
equilibrium conditions are, at the moment, beyond the reach of such simplified approaches considered here. For this reason we 
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believe that the Gaussian approach would be ideally used in conjunction with molecular dynamics by providing, prior to investing 
significant computational resources in all-atom simulations, a fast but accurate characterization of a protein's near-native motion. 
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